pricingengine.estimation package¶
Submodules¶
pricingengine.estimation.double_ml module¶
-
class
pricingengine.estimation.double_ml.
DoubleML
(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)¶ Bases:
pricingengine.estimation.double_ml.DoubleMLLikeModel
Generic Double ML Model. Estimates the coefficient \(\beta\) from the following partially linear model
\(Y = f(X) + \beta \cdot D + \epsilon\)
\(D = g(X) + \mu\)
Note that the base models are cross-fit across folds (so a model’s predictions for its training data are not used).
-
__init__
(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)¶ Initialize a new DoubleML instance.
Parameters: - schema – The expected schema of datasets that will be fit
- baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models in first stage regressions. This object may also be a dict which points from each column name (all treatment and outcome variables) to a corresponding Model.
- causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
- error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
- feature_builders – List of VarBuilder objects used to create features for first stage regressions
- treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
- sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
- cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
-
baseline_outcome_coefficients
()¶ Return coefficients (averaged over splits) from first stage outcome regression Will account for baseline feature scaling
-
baseline_treatment_coefficients
(treatment_name)¶ Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given treatment_name. Will account for baseline feature scaling
-
error_model
¶ Return the model used to comptue predicted (absolute) error size
-
fit_baseline_models_featurized
(features, outcome, treatments, folds)¶ Fit first-stage baseline models (but not causal model) for predicting treatment and outcome. Sub-utility used by fit_baseline_models().
Parameters: - features – dictionary of features used for prediction (expects one for error too)
- outcome – dictionary of leads mapping to series of outcome leads
- treatments – double dictionary mapping from lead and treatment_name to series of treatment leads
- folds – list of train test splits used for cross validation
-
static
gen_prepredicted
(df)¶ Converts a DataFrame of recorded predictions in dictionary of PrePredicted models
-
static
get_rec_df_from_csv
(fname, schema)¶ Reads a csv file with recordings from a DoubleML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings
-
outcome_baseline_models
¶ Return the outcome baseline models
-
predict_baseline
(features, folds=None)¶ Parameters: - features – Either a single feature matrix or a dictionary:varname->feature matrix
- folds –
-
treatment_baseline_models
¶ Return the treatment baseline models
-
-
class
pricingengine.estimation.double_ml.
DoubleMLLikeModel
(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)¶ Bases:
pricingengine.estimation.regression.Estimation
An abstract baseclass for DoubleML-like models (DoubleML/DynamicDML, etc.)
-
NO_SPLIT
= 'no split'¶
-
TYPE_COL_NAME
= 'type'¶
-
__init__
(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)¶ Parameters: - schema – The expected schema of datasets that will be fit
- causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
- treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
- feature_builders – List of VarBuilder objects used to create features for first stage regressions
- sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
- cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
- no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.
-
baseline_fit_diagnostics
()¶ Get various prediction diagnostics for all baseline (first stage) regressions
-
baseline_models_feat_info
(avg_splits=False, combine_vars=False)¶ Return baseline model coefficients for all first stage models for the given lead
Parameters: - avg_splits (bool) – If true avgs diagnostics acrss model splits (otherwise returns separately).
- combine_vars – Try to combine the different variable vectors into a df (works if same feature vector) If True, will return a single DF. If false, will return a dict:varname->DF (aggregated across leads)
-
causal_model
¶ Return the causal model
-
fit_baseline_models
(estimation_dataset)¶ Fit baseline (but not causal models) on DDML object
Parameters: estimation_dataset – EstimationDatset object on which baseline models are fit
-
fit_causal_model
(estimation_dataset, rm_baseline_interm_info=False, subst_treatment_builders=None)¶ Fit only the causal model of DDML. Requires that you have already fit baseline models.
Parameters: - estimation_dataset –
- rm_baseline_interm_info – If you want to fit several different causal models, pass in rm_baseline_interm_info=False
- subst_treatment_builders – overwrites existing treatment_builders in case you want to try a different model
-
num_splits
¶ Number of splits for cross-fitting
-
pricingengine.estimation.dynamic_dml module¶
-
class
pricingengine.estimation.dynamic_dml.
BaseAndError
(leads)¶ Bases:
object
Model that will fit and predict baseline models and error
-
__init__
(leads)¶
-
baseline_fit_diagnostics
()¶
-
baseline_models_feat_info
(avg_splits=False, combine_vars=False)¶
-
error_model_predict
(features_fit)¶
-
fit_baseline_models_featurized
(common_features, lead_features, outcome_lead, treatments_lead, folds)¶
-
fit_error_model
(features_fit, err)¶
-
static
gen_prepredicted
(df)¶
-
predict_baseline
(common_features, lead_features, fold_fit_info)¶
-
-
class
pricingengine.estimation.dynamic_dml.
DDMLOptions
(min_lead=1, max_lead=1)¶ Bases:
object
Options for computing effects using Dynamic DoubleML
-
__init__
(min_lead=1, max_lead=1)¶ Create a new DDMLOptions instance.
Parameters: - min_lead – Smallest lead to model
- max_lead – Largest lead to model
-
leads
¶ Return list where each element is a number of periods ahead to compute effects
-
-
class
pricingengine.estimation.dynamic_dml.
DynamicDML
(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)¶ Bases:
pricingengine.estimation.double_ml.DoubleMLLikeModel
A series of DoubleML models, each lead contains a separate first stage model that corresponds to forecasting the outcome at a given lead. There is also a common causal_model that corresponds to causal impacts of treatments which are jointly learned from all models.
-
LEAD_LEVEL_NAME
= 'lead'¶
-
__init__
(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)¶ Create a new instance of an effect model.
Parameters: - schema (Schema) – The schema for subsequent training and prediction data
- baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models. This object may also be a dict which points from each column name to a corresponding Model.
- causal_model (CausalModel) – Model to be used for computing treatment effects
- error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
- feature_builders – List of VarBuilder objects used to create features for first stage regressions
- treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
- training_filter – function that takes the feature generator and estimation_dataset and returns a vector of bools which indicates which observations should be used for training. Default is all observations.
- options (DDMLOptions) – Model options
- outcome_model_type – FeatureGenerator.LEVEL_MODEL (default) trains first stage model in levels. FeatureGenerator.DIFF_MODEL trains first stage outcome model on first differences
- treatment_diff_models – list of treatments that are estimated in first differences (Default is LEVEL)
- sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
- cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
- cv_structure_fn – function that takes the df multiindex and returns labelled structure. Used by either GroupKFold or StratifiedKFold. Default is to use the time variable.
- multi_task – Bool (default False) to indicate whether one instance of the specified baseline model will be used to make predictions for multiple leads. (Only certain models have this capability, for instance CNTKCausalModel). If False, then N copies of the model will be used to model the outcome for each lead.
- no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.
-
static
gen_prepredicted_baselines
(df, base_error_class=None)¶ Converts a DataFrame of recorded predictions in dictionary of PrePredicted models
-
get_design_matrices
(dataset)¶ Gets all the design (and related) matrices from all the stages :param dataset: Needs to have same schema as estimation dataset but can be much smaller :return: Tuple of baseline_variables, baseline_features, train_fold, causal_outcomes, causal_treatments.
The first two are nested dictionaries of lead->varname->data The inner datasets of the first two, and train_fold all have the same row index so can be concatted. The final two datasets are the causal regression. The causal variables will be the original values (possibly scaled) rather than residuals. For the error model, query the first two with Model.ERROR_VAR_NAME (Note: folds are meaningless here).
-
get_diffed_vars
()¶
-
get_marginal_effects
(treatment_name, competition_col, leads=None, filter_dic=None)¶
-
static
get_rec_df_from_csv
(fname, schema)¶ Reads a csv file with recordings from a DynamicDML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings
-
options
¶ Return the options given during initialization
-
outcome_coefficients
(lead)¶ Get first stage coefficients (averaged over splits) from outcome regression corresponding to the given lead. Will account for baseline feature scaling
Parameters: lead – integer corresponding to preferred lead
-
static
translate_prediction_to_rec
(pred_df, date_col, exp_ind=True)¶ Takes back the targets according to lead (because in fitting they are lagged to the information date)
-
static
translate_rec_to_prediction
(rec_df, leads, date_col)¶ Advances the targets according to lead (in the fitting they were lagged to the information date) and then averages across the folds of the model.
-
treatment_coefficients
(lead, treatment_name)¶ Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given lead and treatment_name. Will account for baseline feature scaling
Parameters: - treatment_name (str) – name of treatment variable
- lead (int) – preferred lead
-
-
class
pricingengine.estimation.dynamic_dml.
MultiTaskBaseAndError
(schema, baseline_model, causal_model, error_model, n_splits, leads)¶ Bases:
pricingengine.estimation.dynamic_dml.BaseAndError
-
__init__
(schema, baseline_model, causal_model, error_model, n_splits, leads)¶
-
baseline_fit_diagnostics
()¶
-
baseline_models_feat_info
(avg_splits=False, combine_vars=False)¶
-
error_model_predict
(features_fit)¶
-
fit_baseline_models_featurized
(common_features, lead_features, outcome_lead, treatments_lead, folds)¶ Parameters: - common_features – features common to all leads
- lead_features – lead-specific features
- outcome_lead – dict mapping lead to outcome variable values
- treatments_lead – dict mapping lead to treatment variable values
- folds – list of train test splits used for cross validation
-
fit_error_model
(features_fit, err)¶
-
static
gen_prepredicted
(df)¶
-
outcome_coefficients
(lead)¶
-
predict_baseline
(common_features, lead_features, fold_fit_info)¶ Parameters: - common_features – features common to all leads
- lead_features – lead-specific features
- fold_fit_info – same data format as folds variable in fit_baseline_models_featurized
-
treatment_coefficients
(lead, treatment_name)¶
-
pricingengine.estimation.estimation_dataset module¶
-
class
pricingengine.estimation.estimation_dataset.
EstimationDataSet
(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)¶ Bases:
pricingengine.estimation.typed_dataset.TypedDataSet
Dataset with known schema used for generating features
-
__init__
(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)¶ Parameters: - data – pandas dataframe containing date, units, and price columns with a single column index
- schema – Schema describing the data
- validators – A list of validators to use for verifying data integrity
- fold_fit_info – Series where each value is the index of the model that has this as the test portion or NaN if all folds can have this as test (when fit on subset of dataset). This is None if this dataset has been fit. Will be set after fit. We store this rather than folds since we can filter more easily.
-
append_data_one_instance
(panel_dic, treatments_path, start_date)¶ Returns a new estimation_dataset object with additional rows corresponding to the product specified in panel_dic and the given price_path. The synthetic data will begin on the start_date and carry forward at the same intervals as the rest of the data. If necessary, it will overwrite pre-existing data.
Parameters: - panel_dic – dictionary of panel values that must specify a unique instance
- treatments_path – dictionary (keyed by treatment_names) with values as iterables of numbers specifying the planned treatments of that instance week-by-week going forward. The 0th value of each iterable corresponds to the start_date
- start_date – First week in which the price_path is applied. I.E. price_path[0] specifies the price on the start_date. This value must be in the estimation_dataset or immediately following an observation in the estimation_dataset.
-
static
convert_folds_across_indexes
(orig_folds, orig_idx, new_idx)¶ Converts fold info from one DataFrame index to another
-
data
¶ Returns data
-
data_interval
¶ Return the temporal spacing between consecutive data points
-
filter
(filter_dic=None, first_date=None, last_date=None)¶ Returns a new estimation_dataset object which is filtered by the requirements in the filter_dic
Parameters: - filter_dic – dictionary mapping data columns to lists of allowed values
- first_date – omit any data before this date
- last_date – omit any data from after this date
-
fold_fit_info
¶ Returns the folds (test part at least) for what was fit
-
static
from_df
(df, treatment_colname='treatment', outcome_colname='units', date_colname='date', is_panel_col=<function EstimationDataSet.<lambda>>, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}))¶ Create an EstimationDataSet from the given dataframe.
Parameters: - df –
A pandas dataframe containing price, units, and date columns.
- String columns and panel columns will be interpreted as categorical columns
- Float/int columns will be interpreted as numeric columns (convert numeric columns to string if
- the column is to be interpreted as categorical)
- treatment_colname – The name of the numeric column containing treatment
- outcome_colname – The name of the numeric column quantities
- date_colname – The name of the datetime column containing dates
- is_panel_col – A function that takes in a column names and returns a boolean indicating if the column is used to break the dataset into panels
- validators – list of validators applied to the produced EstimationDataSet
- df –
-
gen_folds_for_new_index
(new_idx)¶ Converts fold info from this object’s index to another
-
schema
¶ Returns schema
-
set_folds_from_other_index
(other_folds, other_idx)¶ Sets this object’s fold info to that from another context (fold_info and index)
-
pricingengine.estimation.regression module¶
-
class
pricingengine.estimation.regression.
Estimation
(schema, cluster_date)¶ Bases:
object
-
__init__
(schema, cluster_date)¶
-
fit
(estimation_dataset)¶ Fit baseline and causal models on the given dataset
Parameters: estimation_dataset (EstimationDataSet) – A dataset on which to train the model
-
get_coefficients
(human_index=True)¶ Get coefficients from the causal model
Parameters: human_index – If True, then the interactions levels of the multiindex are squashed. Otherwise, they are are left separate (useful for automated post-processing).
-
get_standard_errors
(human_index=True)¶ Get standard errors from the causal model
-
get_variance_matrix
(human_index=True)¶ Get variance matrix from the causal model
-
predict
(dataset, ret_pred=None)¶ Compute predictions for the given dataset using previously trained model
Parameters: - dataset (EstimationDataset) – A dataset containing features from which to generate predictions. The schema of the dataset must match the schema of the dataset used to fit the model.
- ret_pred – Pass in an empty dataframe if you want that dataframe to be populated with predictions of the first stage models
Raises: - ValueError – If the schema of the given dataset does not match the schema given for initialization
- RuntimeError – If the model has not yet been fit
-
-
class
pricingengine.estimation.regression.
Regression
(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)¶ Bases:
pricingengine.estimation.regression.Estimation
Class for implement estimation with VarBuilders
-
__init__
(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)¶ Initialize a new Regression instance.
Parameters:
-
error_model
¶ Return the error model
-
model
¶ Return the causal model
-
pricingengine.estimation.typed_dataset module¶
-
class
pricingengine.estimation.typed_dataset.
ColType
¶ Bases:
enum.Enum
Input IDs for know pricing data column content
-
OUTCOME
= 10¶
-
OUTCOME_RESIDUAL
= 12¶
-
PREDETERMINED
= 13¶ A description of the data contained in a single column
- The column tagged as ColType.ITEM must have DataType.CATEGORICAL
- The column tagged as ColType.OUTCOME must be have DataType.NUMERIC
- The column tagged as ColType.TREATMENT must be have DataType.NUMERIC
-
TREATMENT
= 9¶
-
TREATMENT_RESIDUAL
= 11¶
-
-
class
pricingengine.estimation.typed_dataset.
TypedDataSet
(data, schema, required_types)¶ Bases:
pricingengine.dataset.DataSet
Dataset class
-
__init__
(data, schema, required_types)¶ Initializes a new instance of the DataSet class. The DataSet class combines time series data, a schema that specifies the column meta-data for the the given time series data.
The given data-schema pair needs to adhere to the following expectations:
Each column defined in the given schema must be contained in the corresponding given time series data
Each column must have a data type corresponding to its schema DataType as follows:
- DataType.NUMERIC: integer or floating-point
- DataType.DATE_TIME: datetime
- DataType.CATEGORICAL: string or integer
In the specified schema, the name of the column with id ITEM must also be included in the list of panel column names
Parameters: - data – The time series data to be used for computing effects
- schema – The schema specifying the meta-data for the time series
-
group_labels
¶ A list parallel to the rows of the dataset with a label for each row. The labels can be passed to a Pandas groupby call to group data using known groups.
-